INLA packageIn space, we can have:
In time, we can have:
.shp, .shx, and .dbf files.rgdal (includes the readOGR() function for
reading .shp files), spdep,
maptools, and ggplot2In this analysis, we will study a dataset that records the number of people from each NYC ZIP code who were hospitalized for (potential) COVID-19 each day in March 2020. This data comes from the NYC Department of Health and Mental Hygiene; my colleagues and I study it in this research paper.
# Read the COVID hospitalization data from a CSV, specifying that the first column
# contains the "row names" for this dataset (these are the ZIP codes). The
# 'check.names' option tells R to not modify the column names, which are dates.
NYC_COVID <- read.csv("Data/ZipcodeHospitalizations.csv", row.names=1, check.names=F)
# Save the dates for future use
dates <- as.Date(colnames(NYC_COVID))
# Print first 5 rows and first 7 columns of the data
head(NYC_COVID, n=c(5,7))
## 2020-03-01 2020-03-02 2020-03-03 2020-03-04 2020-03-05 2020-03-06
## 10001 10 11 6 11 8 5
## 10002 9 19 23 17 15 18
## 10003 3 8 9 6 7 9
## 10004 0 0 3 0 2 1
## 10005 0 0 0 4 0 2
## 2020-03-07
## 10001 7
## 10002 21
## 10003 7
## 10004 0
## 10005 0
Next we read in the shapefile that defines ZIP code boundaries for NYC. The shapefile can be downloaded from NYC Open Data at this link.
# Load the rgdal package, which contains the readOGR function for reading shapefiles
library(rgdal)
# Read NYC shapefile
NYC_SHP <- readOGR("Shapefiles/ZIP_CODE_040114.shp")
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/saravenkatraman/Library/Mobile Documents/com~apple~CloudDocs/Documents/Cornell University (PhD)/Presentations/Spatial Modeling/Shapefiles/ZIP_CODE_040114.shp", layer: "ZIP_CODE_040114"
## with 263 features
## It has 12 fields
It looks like the shapefile has data about 263 ZIP codes in NYC whereas the COVID dataset has only 173, a few of which are actually outside the city. Before we proceed, we’ll restrict both the shapefile and the COVID dataset to include the same ZIPs.
# Find which ZIP codes are in both the NYC shapefile and the COVID dataset.
# Recall that the row names of the COVID dataset are ZIPs.
zips <- intersect(NYC_SHP$ZIPCODE, rownames(NYC_COVID))
# Cut down the shapefile to just these ZIPs
NYC_SHP <- NYC_SHP[NYC_SHP$ZIPCODE %in% zips, ]
# Cut down the COVID dataset to just these ZIPs. Note that this also reorders
# the rows to match the order of the ZIPs in the shapefile.
NYC_COVID <- NYC_COVID[zips, ]
We are left with 169 ZIP codes, but the shapefile still contains 178 rows. This is because there is duplicate information for a few of the ZIPs. We’ll remove these duplicates:
NYC_SHP <- NYC_SHP[! duplicated(NYC_SHP$ZIPCODE), ]
It turns out that this shapefile also contains the populations of each ZIP code, which we’ll need for our analyses. We’ll save this information in its own dataframe.
# Save populations of each ZIP code into a dataframe
NYC_Populations <- data.frame(pop = NYC_SHP$POPULATION, row.names = zips)
# Print out the first few rows
head(NYC_Populations)
## pop
## 11213 62426
## 11212 83866
## 11225 56527
## 11218 72280
## 11226 106132
## 11219 92561
The ZIP code 10162 is missing population data, so we will fill it in using 2020 Census data.
NYC_Populations[which(rownames(NYC_Populations) == "10162"), "pop"] <- 1814
The ggplot2 package can be used to visualize spatial
data on maps, and the plotly package can be used to make
the plot interactive, i.e. to display information upon hovering or
clicking on the plot. First we’ll draw a map of the total number of
hospitalizations from each ZIP code.
Note: We should probably plot the hospitalization rate
instead, e.g. the fraction of each ZIP’s population that was
hospitalized, but we’ll stick to raw counts just for demonstration. To
get the rate, we would divide rowSums(NYC_COVID) by
NYC_Populations$pop in the code below.
# Load ggplot2 package for plotting, plotly package for interactive plotting,
# and viridis package for color palettes
library(ggplot2)
library(plotly)
library(viridis)
library(maptools)
# Compute cumulative hospitalization counts
COVID_cumulative <- data.frame(ZIP = zips, Count = rowSums(NYC_COVID))
# Convert spatial polygons from the shapefile into a dataframe that gives the
# latitude and longitude for each point defining the ZIP code boundaries
# plotData <- fortify(NYC_SHP, region = "ZIPCODE")
plotData <- broom::tidy(NYC_SHP)
# Add an ID column to cumulative COVID dataset to enable merging with plotData
COVID_cumulative$ID <- unique(plotData$id)
# Merge hospitalization rates with map data. Note that in the plotData dataframe,
# the ZIP codes are in the 'id' column.
plotData <- merge(plotData, COVID_cumulative, by.x = "id", by.y = "ID")
# Construct map: id (ZIP code) is the text that'll be shown when hovering over plot
map1 <- ggplot(plotData, aes(x = long, y = lat, text = ZIP)) +
# Draw polygons colored according to the hosp. rate, with a black border
geom_polygon(aes(group = group, fill = Count), color = "black", size = 0.2) +
# Specify purple color palette, with darker shades indicating higher rate
scale_fill_distiller(palette = "Purples", direction = 1) +
# Set plot title
ggtitle("Total COVID-related hospitalization rates, March 2020") +
# Remove background and axes, and center the plot title
theme_void() + theme(plot.title = element_text(hjust = 0.5))
# Draw the interactive map
ggplotly(map1)
To visualize the temporal dynamics of disease spread, we can look at the cumulative hospitalizations at several intermediate points in time. Below, we plot the cumulative hospitalization counts after each week in March 2020, i.e. after 8, 15, 23, and 30 days.
# Compute cumulative hospitalization rates after 8, 15, 23, 30 days
COVID_cumulative <- data.frame(ZIP = zips, sapply(c(8, 15, 23, 30), function(x) rowSums(NYC_COVID[, 1:x])))
colnames(COVID_cumulative)[-1] <- c("Week 1", "Week 2", "Week 3", "Week 4")
# Convert spatial polygons from the shapefile into a dataframe that gives the
# latitude and longitude for each point defining the ZIP code boundaries
plotData <- broom::tidy(NYC_SHP)
# Add an ID column to cumulative COVID dataset to enable merging with plotData
COVID_cumulative$ID <- unique(plotData$id)
# Merge hospitalization rates with map data. Note that in the plotData dataframe,
# the ZIP codes are in the 'id' column.
plotData <- merge(plotData, reshape2::melt(COVID_cumulative), by.x = "id", by.y = "ID")
# Rename the "variable" and "value" columns of plotData to "Week" and "Rate
colnames(plotData)[which(colnames(plotData) == "variable")] <- "Week"
colnames(plotData)[which(colnames(plotData) == "value")] <- "Count"
# Construct map: id (ZIP code) is the text that'll be shown when hovering over plot
map2 <- ggplot(plotData, aes(x = long, y = lat, text = ZIP)) +
# Draw polygons colored according to the hosp. rate, with a black border
geom_polygon(aes(group = group, fill = Count), color="black", size=.2) +
# Put the rates for each week on a separate plot, and arrange plots into 4-column layout
facet_wrap(~ Week, ncol = 2) +
# Specify purple color palette, with darker shades indicating higher rate
scale_fill_distiller(palette = "Purples", direction = 1) +
# Set plot title
ggtitle("Cumulative COVID-related hospitalizations, March 2020\n") +
# Remove background and axes, and center the plot title
theme_void() + theme(plot.title = element_text(hjust = 0.5))
# Draw the interactive map, statically or interactively
map2
# ggplotly(map2)
The above maps help visualize the temporal evolution of spatially-localized outbreaks. In the Jackson Heights/Corona area of Queens, for instance, a severe outbreak seems to have been relatively contained within the neighborhood and the ZIP codes immediately surrounding it. By contrast, outbreaks originating in the Melrose area of the Bronx and East Flatbush in Brooklyn seem to have spread to neighboring ZIP codes somewhat more evenly.
To visualize city-wide variation in hospitalizations over time, we will plot two time series side-by-side:
# Create a dataframe comprised of the dates in March, the hospitalization counts each
# day, and the cumulative hospitalization counts each day. Note that we obtain
# the daily count by taking the sum of each column in NYC_COVID.
plotData <- data.frame(date = dates, count = colSums(NYC_COVID), cumu_count = cumsum(colSums(NYC_COVID)))
# Construct first plot: x-axis is time, y-axis is hospitalizations
p1 <- ggplot(plotData, aes(x = date, y = count)) +
# Draw points and lines, set axis labels
geom_point() + geom_line() + labs(x = "Date", y = "Hospitalization count") +
# Set plot title
ggtitle("COVID-19 hospitalizations in NYC, March 2020") +
# Center the plot title
theme(plot.title = element_text(hjust = 0.5))
# Construct second plot: x-axis is time, y-axis is cumulative hospitalizations
p2 <- ggplot(plotData, aes(x = date, y = cumu_count)) +
# Draw points and lines, set axis labels
geom_point() + geom_line() + labs(x = "Date", y = "Cumulative hospitalization count") +
# Set plot title
ggtitle("Cumulative COVID-19 hosp. in NYC, March 2020") +
# Center the plot title
theme(plot.title = element_text(hjust = 0.5))
# Load package for arranging ggplots on a grid
library(gridExtra)
grid.arrange(p1, p2, ncol=2)
We will next review ways to assess the presence of spatial autocorrelation, or spatial dependence, in our dataset. That is, how similar are ZIP code hospitalization counts are to those in neighboring ZIPs?
The Moran’s I-statistic is a measure of spatial dependence that we can use to investigate this question. If \(x_1,...,x_N\) are measurements (e.g., COVID case counts) at \(N\) spatial locations (e.g. ZIP codes), the Moran’s \(I\)-statistic is defined as:
\[\text{Moran's } I \;\;=\;\; \frac{\sum_{i=1}^N \text{covariance between location } i \text{ and its neighbors}}{\text{variance over all locations}} \;\;=\;\; \frac{N\sum_{i=1}^N \sum_{j=1}^N w_{ij}(x_i-\bar x)(x_j-\bar x)}{W\sum_{i=1}^N (x_i-\bar x)^2}\]
where:
The Moran’s \(I\)-statistic is between \(-1\) (negative correlation) and \(1\) (positive correlation), and we can test for its statistical significance. Below, we compute the Moran’s \(I\) associated with the total hospitalization count in each ZIP code. First, we need to identify each ZIP’s set of neighbors:
# Load spdep package for the poly2nb function, which will identify neighbors
library(spdep)
# Construct an adjacency matrix to identify each ZIP code's set of neighbors
NYC_neighbors <- poly2nb(NYC_SHP)
We’ll print out the first few entries of the
NYC_neighbors list:
head(NYC_neighbors)
## [[1]]
## [1] 2 3 112 155 157
##
## [[2]]
## [1] 1 102 112 114 157
##
## [[3]]
## [1] 1 5 105 110 112 155
##
## [[4]]
## [1] 5 6 8 9 110 111
##
## [[5]]
## [1] 3 4 7 8 110 112
##
## [[6]]
## [1] 4 9 111 115 123
This tells us that ZIP code #1 has ZIPs 2, 3, 112, 155, and 157 as its neighbors, for instance. We can see check which ZIP codes these correspond to:
zips[c(1, NYC_neighbors[[1]])]
## [1] "11213" "11212" "11225" "11203" "11216" "11233"
ZIP code #1 is 11213 (Crown Heights, Brooklyn), and the other ZIP codes immediately surround it.
Further inspection of the NYC_neighbors list reveals
that several ZIPs are not assigned to have any neighbors, which will
produce errors for subsequent calculations. We resolve this by manually
assigning these neighbors. First we check which ZIPs do not have any
neighbors recorded:
zips_without_neighbors <- which(sapply(NYC_neighbors, sum) == 0)
zips[zips_without_neighbors]
## [1] "10044" "11222" "11237" "11385" "10162"
We’ll manually set these neighbors:
NYC_neighbors[[zips_without_neighbors[1]]] <- which(zips %in% c("10022", "10065", "10021", "11101"))
NYC_neighbors[[zips_without_neighbors[2]]] <- which(zips %in% c("11101", "11211", "11378"))
NYC_neighbors[[zips_without_neighbors[3]]] <- which(zips %in% c("11385", "11211", "11206", "11221", "11207"))
NYC_neighbors[[zips_without_neighbors[4]]] <- which(zips %in% c("11379", "11378", "11237", "11207", "11208", "11421", "11375"))
NYC_neighbors[[zips_without_neighbors[5]]] <- which(zips %in% c("10021", "10075"))
# Construct an adjacency matrix to identify each ZIP code's set of neighbors
NYC_neighbors <- poly2nb(NYC_SHP)
# Compute spatial weights: for each ZIP, weights are 1/N, where N = number of neighbors
NYC_neighbors <- nb2listw(NYC_neighbors, zero.policy = TRUE)
# Compute global Moran's I-statistic for total case counts
COVID_moran <- moran.test(rowSums(NYC_COVID), NYC_neighbors, zero.policy = TRUE)
# Compute global Moran's I statistic for cumulative case counts up to each day in March 2020
cumulative_moran <- data.frame(date = dates[-1], statistic = 0, pvalue = 0)
for(i in 2:30) {
COVID_moran_day <- moran.test(rowSums(NYC_COVID[, 1:i]), NYC_neighbors, zero.policy = TRUE)
cumulative_moran[i-1, "statistic"] <- COVID_moran_day$estimate[1]
cumulative_moran[i-1, "pvalue"] <- COVID_moran_day$p.value
}
ggplot(cumulative_moran, aes(x = date, y = statistic)) + geom_point() + geom_line() +
labs(x="Date", y="Moran's I statistic", title="Moran's I statistic for cumulative hospitalizations up to each day") + theme(plot.title=element_text(hjust=0.5))
# Local Moran's I statistic
cumulative_local_moran_stat <- data.frame(matrix(0, nrow = length(zips), ncol = length(dates)-1))
colnames(cumulative_local_moran_stat) <- dates[-1]
rownames(cumulative_local_moran_stat) <- zips
cumulative_local_moran_pvalue <- cumulative_local_moran_stat
for(i in 2:30) {
COVID_moran_day <- localmoran(rowSums(NYC_COVID[, 1:i]) / NYC_Populations$pop, NYC_neighbors, zero.policy = TRUE)
cumulative_local_moran_stat[, i-1] <- COVID_moran_day[,1]
cumulative_local_moran_pvalue[, i-1] <- COVID_moran_day[,5]
}
# Plot local Moran's statistics for 11368 (Jackson Heights/Corona), 11203 (East Flatbush), 10002 (LES), 10451 (Melrose)
plotData <- cumulative_local_moran_stat[c("10002", "11203", "11368", "10451"),]
rownames(plotData) <- c("10002 (Lower East Side, Manhattan)", "11203 (East Flatbush, Brooklyn)", "11368 (Jackson Heights/Corona, Queens)", "10451 (Melrose, Bronx)")
plotData <- reshape2::melt(t(plotData))
colnames(plotData) <- c("Date", "zip", "I")
p1 <- ggplot(plotData, aes(x = as.Date(Date), y = I)) + geom_point() + geom_line() + facet_wrap(~ zip, nrow=1) +
labs(x = "Date", y = "Local Moran's I statistic", title = "Local Moran's I statistic for cumulative hospitalizations up to each day in select ZIP codes") + theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 45, hjust = 1))
# Plot the corresponding p-values
plotData <- cumulative_local_moran_pvalue[c("10002", "11203", "11368", "10451"),]
rownames(plotData) <- c("10002 (Lower East Side, Manhattan)", "11203 (East Flatbush, Brooklyn)", "11368 (Jackson Heights/Corona, Queens)", "10451 (Melrose, Bronx)")
plotData <- reshape2::melt(t(plotData))
colnames(plotData) <- c("Date", "zip", "p")
p2 <- ggplot(plotData, aes(x = as.Date(Date), y = p)) + geom_point() + geom_line() + facet_wrap(~ zip, nrow = 1) +
labs(x = "Date", y = "Local Moran's I p-value", title = "Local Moran's I p-value") +
geom_hline(yintercept = 0.05, linetype = "dashed", color = "blue") + theme(plot.title=element_text(hjust = 0.5), axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(p1, p2, nrow = 2)
We can also identify the spatial clustering pattern of each area and plot it on a choropleth.
COVID_cumulative_LISA <- attr(localmoran(rowSums(NYC_COVID) / NYC_Populations$pop, NYC_neighbors, zero.policy = TRUE), "quadr")["mean"]
COVID_cumulative_LISA$zip <- zips
colnames(COVID_cumulative_LISA)[1] <- "LISA"
# Mark non-significant local statistics
COVID_cumulative_LISA$LISA <- as.character(COVID_cumulative_LISA$LISA)
COVID_cumulative_LISA$LISA[which(cumulative_local_moran_pvalue[, 29] > 0.05)] <- "Not significant"
# As before, convert spatial polygons from the shapefile into a longitude-latitude
# dataframe and merge the clustering results
plotData <- broom::tidy(NYC_SHP)
## Regions defined for each Polygons
COVID_cumulative_LISA$ID <- unique(plotData$id)
plotData <- merge(plotData, COVID_cumulative_LISA, by.x = "id", by.y = "ID")
# Draw the map
ggplot(plotData, aes(x = long, y = lat, text = id)) +
geom_polygon(aes(group = group, fill = LISA), color = "black", size = 0.2) +
ggtitle("Spatial clusters of COVID-19 hospitalizations, March 2020") +
scale_fill_manual(values = c("indianred3", "lightpink", "lightskyblue3", "blue", "white")) +
theme_void() + theme(plot.title = element_text(hjust = 0.5))
We begin by fitting a model of the total hospitalization counts in each ZIP code, so there is no temporal component just yet:
\[\log(\rho_i) = \alpha + \beta_i\]
where:
Specifically, \(\beta_i = \beta_{1,i} + \beta_{2,i}\), where \(\beta_{1,i}\) is a spatially-structured residual and \(\beta_{2,i}\) is an unstructured residual. The unstructured residual \(\beta_{2,i}\) is assumed to have a \(N(0, \sigma^2_i)\) distribution. The structured residual \(\beta_{1,i}\) is assumed to have the following conditional distribution:
\[\begin{gather*} \beta_{1,i}|\beta_{1,j\neq i} \sim N(m_i, s_i^2) \\ \text{where } m_i = \text{average of } \beta_{1,j} \text{ for neighboring indices } j \\ \text{and } s_i^2 = \text{average of } \sigma^2_j \text{ for neighboring indices } j \end{gather*}\]
library(INLA)
## Loading required package: Matrix
## This is INLA_24.02.09 built 2024-02-09 03:43:24 UTC.
## - See www.r-inla.org/contact-us for how to get help.
## - List available models/likelihoods/etc with inla.list.models()
## - Use inla.doc(<NAME>) to access documentation
# Gather the data needed for the model
modelData <- data.frame(zipID = 1:length(zips), count = rowSums(NYC_COVID), population = NYC_Populations$pop)
# Create list of each ZIP’s neighbors (this writes a file into the working directory)
nb2INLA("NYC.graph", poly2nb(NYC_SHP))
# Define the formula for this model
model0Formula <- count ~ 1 + f(zipID, model = "bym", graph = paste("NYC.graph"))
# Fit the model using the INLA algorithm
spatialModel <- inla(model0Formula, family = "poisson", offset = log(population), data = modelData)
# Look at exponentiated coefficients to get city-wide rate
exp(spatialModel$summary.fixed)
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 0.01217149 1.030081 0.0114818 0.01217186 0.01290044 0.01217186 1
The \(f(\cdot)\) formulation allows
us to specify the structure of the term inside it. The default option is
iid, but we can see all the other possible options as
follows:
names(inla.models()$latent)
## [1] "linear" "iid" "mec" "meb" "rgeneric"
## [6] "cgeneric" "rw1" "rw2" "crw2" "seasonal"
## [11] "besag" "besag2" "bym" "bym2" "besagproper"
## [16] "besagproper2" "fgn" "fgn2" "ar1" "ar1c"
## [21] "ar" "ou" "intslope" "generic" "generic0"
## [26] "generic1" "generic2" "generic3" "spde" "spde2"
## [31] "spde3" "iid1d" "iid2d" "iid3d" "iid4d"
## [36] "iid5d" "iidkd" "2diid" "z" "rw2d"
## [41] "rw2diid" "slm" "matern2d" "dmatern" "copy"
## [46] "scopy" "clinear" "sigm" "revsigm" "log1exp"
## [51] "logdist"
We also used family = "poisson" above, but there are
many other options for the distribution of the data (likelihood):
names(inla.models()$likelihood)
## [1] "fl" "poisson"
## [3] "nzpoisson" "xpoisson"
## [5] "cenpoisson" "cenpoisson2"
## [7] "gpoisson" "poisson.special1"
## [9] "0poisson" "0poissonS"
## [11] "bell" "0binomial"
## [13] "0binomialS" "binomial"
## [15] "xbinomial" "pom"
## [17] "bgev" "gamma"
## [19] "gammasurv" "gammajw"
## [21] "gammajwsurv" "gammacount"
## [23] "qkumar" "qloglogistic"
## [25] "qloglogisticsurv" "beta"
## [27] "betabinomial" "betabinomialna"
## [29] "cbinomial" "nbinomial"
## [31] "nbinomial2" "cennbinomial2"
## [33] "simplex" "gaussian"
## [35] "stdgaussian" "gaussianjw"
## [37] "agaussian" "ggaussian"
## [39] "ggaussianS" "rcpoisson"
## [41] "tpoisson" "circularnormal"
## [43] "wrappedcauchy" "iidgamma"
## [45] "iidlogitbeta" "loggammafrailty"
## [47] "logistic" "sn"
## [49] "gev" "lognormal"
## [51] "lognormalsurv" "exponential"
## [53] "exponentialsurv" "coxph"
## [55] "weibull" "weibullsurv"
## [57] "loglogistic" "loglogisticsurv"
## [59] "stochvol" "stochvolsn"
## [61] "stochvolt" "stochvolnig"
## [63] "zeroinflatedpoisson0" "zeroinflatedpoisson1"
## [65] "zeroinflatedpoisson2" "zeroinflatedcenpoisson0"
## [67] "zeroinflatedcenpoisson1" "zeroinflatedbetabinomial0"
## [69] "zeroinflatedbetabinomial1" "zeroinflatedbinomial0"
## [71] "zeroinflatedbinomial1" "zeroinflatedbinomial2"
## [73] "zeroninflatedbinomial2" "zeroninflatedbinomial3"
## [75] "zeroinflatedbetabinomial2" "zeroinflatednbinomial0"
## [77] "zeroinflatednbinomial1" "zeroinflatednbinomial1strata2"
## [79] "zeroinflatednbinomial1strata3" "zeroinflatednbinomial2"
## [81] "t" "tstrata"
## [83] "nmix" "nmixnb"
## [85] "gp" "dgp"
## [87] "logperiodogram" "tweedie"
## [89] "fmri" "fmrisurv"
## [91] "gompertz" "gompertzsurv"
Now we will plot the ZIP code-specific relative risks of hospitalization on a map.
# Plot the ZIP-specific relative risks of hospitalization
zipResiduals <- data.frame(zip = zips, resid = 0)
# Extract each ZIP's value (the sum of the structured and unstructured parts)
for(i in 1:length(zips))
zipResiduals[i, "resid"] <- sum(spatialModel$summary.random$zipID$mean[c(i, i + length(zips))])
# Exponentiate the coefficient
zipResiduals$resid <- exp(zipResiduals$resid)
# Draw the map
plotData <- broom::tidy(NYC_SHP)
## Regions defined for each Polygons
zipResiduals$ID <- unique(plotData$id)
plotData <- merge(plotData, zipResiduals, by.x = "id", by.y = "ID")
plotData <- plotData[- which(plotData$zip == "11040"), ] # Remove outlying ZIP from plot
ggplot(plotData, aes(x = long, y = lat, text = zip)) +
geom_polygon(aes(group = group, fill = resid), color = "black", size = 0.2) +
scale_fill_distiller(palette = "Purples", direction = 1) +
ggtitle("ZIP-specific multiplicative factors on city-wide hospitalization rate") +
theme_void() + theme(plot.title = element_text(hjust = 0.5))
# Reshape data into long format
modelData <- reshape2::melt(t(cbind(NYC_COVID)))
colnames(modelData) <- c("Date", "ZIP", "count")
modelData$zipID <- sort(rep(1:length(zips), 30))
# Add two time counters: one will be for random walk term, the other will be for iid error term
modelData$time <- modelData$time2 <- rep(1:length(dates), length(zips))
# Add population data
modelData$population <- rep(NYC_Populations$pop, times = rep(length(dates), length(zips)))
# Create list of each ZIP’s neighbors (this writes a file into the working directory)
nb2INLA("NYC.graph", poly2nb(NYC_SHP))
# Define the formula for this model
model1Formula <- count ~ 1 + f(zipID, model = "bym", graph = paste("NYC.graph")) + f(time, model = "rw1") + f(time2, model = "iid")
# Fit the model using the INLA algorithm
spatiotemporalModel <- inla(model1Formula, family = "poisson", offset = log(population), data = modelData)
# Look at exponentiated coefficients to get city-wide daily rate
exp(spatiotemporalModel$summary.fixed)
## mean sd 0.025quant 0.5quant 0.975quant
## (Intercept) 0.0003778969 1.030167 0.000356414 0.0003779098 0.0004005968
## mode kld
## (Intercept) 0.0003779098 1
# To get a particular ZIP code's fitted values over time, we need to add up the
# two spatial components (structured and unstructured; ZIP-specific) and the two temporal
# components. We will do this for the four ZIP codes whose local Moran's I statistics
# we looked at previously.
outbreakZips <- c("10002", "11203", "11368", "10451")
zipIDs <- which(zips %in% outbreakZips)
zipFittedValues <- matrix(0, nrow = length(dates), ncol = length(outbreakZips), dimnames = list(as.character(dates), outbreakZips))
for(i in 1:length(outbreakZips)) {
spatialComponent <- spatiotemporalModel$summary.random$zipID$mean[zipIDs[i]]
temporalComponent1 <- spatiotemporalModel$summary.random$time$mean
temporalComponent2 <- spatiotemporalModel$summary.random$time2$mean
zipFittedValues[, i] <- exp(spatialComponent + temporalComponent1 + temporalComponent2)
}
# Plot the fitted hospitalization rates over time. Notice that they are vertically shifted
# or dampened versions of the same mean pattern, which is what we'd expect from the
# additive nature of how they're constructed (mean time series + spatial terms)
plotData <- reshape2::melt(zipFittedValues)
colnames(plotData) <- c("Date", "ZIP", "Rate")
plotData$ZIP <- as.factor(plotData$ZIP)
ggplot(plotData, aes(x = Date, y = Rate, color = ZIP)) +
geom_line(aes(group = ZIP)) + scale_color_manual(values = c("red", "forestgreen", "blue", "orange")) +
theme_bw() + labs(title = "Estimated hospitalization rates in outbreak ZIP codes ") +
theme(plot.title=element_text(hjust = 0.5), axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "bottom")
Forecasting future observations in time can be viewed as fitting the same INLA model with some missing data in the response and computing the predictive distribution of those points. Below, we add missing values for the next week of the COVID dataset and re-fit the spatiotemporal model.
# Create new dataset with NA values for 7 days for all ZIP codes
new_data_matrix <- matrix(NA, nrow = nrow(NYC_COVID), ncol = 7)
rownames(new_data_matrix) <- rownames(NYC_COVID)
colnames(new_data_matrix) <- as.character(seq(from = dates[length(dates)] + 1, to = dates[length(dates)] + ncol(new_data_matrix), by = 1))
NYC_COVID_new <- cbind(NYC_COVID, new_data_matrix)
# Reshape data into long format, and add time counters and population data
modelData <- reshape2::melt(t(cbind(NYC_COVID_new)))
colnames(modelData) <- c("Date", "ZIP", "count")
modelData$zipID <- sort(rep(1:length(zips), ncol(NYC_COVID_new)))
modelData$time <- modelData$time2 <- rep(1:ncol(NYC_COVID_new), length(zips))
modelData$population <- rep(NYC_Populations$pop, times = rep(ncol(NYC_COVID_new), length(zips)))
# Define the formula for this model and fit using the INLA algorithm
model1Formula <- count ~ 1 + f(zipID, model = "bym", graph = paste("NYC.graph")) + f(time, model = "rw1") + f(time2, model = "iid")
spatiotemporalModel <- inla(model1Formula, family = "poisson", offset = log(population), data = modelData,
control.predictor = list(compute = TRUE, link = 1))